Convergence issue of gradient descent

Backpropagation

The divergence function minimized is only a proxy for classification error(like Softmax)
Minimizing divergence may not minimize classification error
- Does not separate the points even though the points are linearly separable
- This is because the separating solution is not a feasible optimum for the loss function
Compare to perceptron
- Perceptron rule has low bias(makes no errors if possible)
  - But high variance(swings wildly in response to small changes to input)
- Backprop is minimally changed by new training instances
  - Prefers consistency over perfection(which is good)

Convergence

Univariate inputs

For quadratic surfaces

$\text {Minimize } E=\frac{1}{2} a w^{2}+b w+c$

$\mathrm{w}^{(k+1)}=\mathrm{w}^{(k)}-\eta \frac{d E\left(\mathrm{w}^{(k)}\right)}{d \mathrm{w}}$

Gradient descent with fixed step size $\eta$ to estimate scalar parameter $w$
Using Taylor expansion

$E(w)=E\left(\mathbf{w}^{(k)}\right)+E^{\prime}\left(\mathbf{w}^{(k)}\right)\left(w-\mathbf{w}^{(k)}\right)+E^{\prime\prime}\left(\mathbf{w}^{(k)}\right)\left(w-\mathbf{w}^{(k)}\right)^2$

So we can get the optimum step size $\eta_{opt} = E^{\prime\prime}(w^{(k)})^{-1}$ $η_{o p t} = E^{''} (w^{(k)})^{- 1}$
- For $\eta < \eta_{opt}$ the algorithm will converge monotonically
- For $2\eta_{opt} > \eta > \eta_{opt}$ , we have oscillating convergence
- For $\eta > 2\eta_{opt}$ , we get divergence
For generic differentiable convex objectives
- also can use Taylor expansion to estimate
- Using Newton's method

$\eta_{o p t}=\left(\frac{d^{2} E\left(\mathrm{w}^{(k)}\right)}{d w^{2}}\right)^{-1}$

Multivariate inputs

Quadratic convex function

$E=\frac{1}{2} \mathbf{w}^{T} \mathbf{A} \mathbf{w}+\mathbf{w}^{T} \mathbf{b}+c$

If $A$ is diagonal

$E=\frac{1}{2} \sum_{i}\left(a_{i i} w_{i}^{2}+b_{i} w_{i}\right)+c$

We can optimize each coordinate independently
- Like $\eta_{1,opt} = a^{-1}_{11}$ , $\eta_{2,opt} = a^{-1}_{22}$
- But Optimal learning rate is different for the different coordinates
If updating gradient descent for entire vector, need to satisfy

$\eta < 2 \min_i \eta_{i,opt}$

This, however, makes the learning very slow if $\frac{\max_i \eta_{i,opt}}{\min_i\eta_{i,opt}}$ is large
Solution: Normalize the objective to have identical eccentricity in all directions
- Then all of them will have identical optimal learning rates
- Easier to find a working learning rate
Target

$E=\frac{1}{2} \widehat{\mathbf{w}}^{T} \widehat{\mathbf{w}}+\hat{\mathbf{b}}^{T} \widehat{\mathbf{w}}+c$

So let $\widehat{\mathbf{w}}=\mathbf{S} \mathbf{w}$ , and $S = A^{0.5}$ , $\hat{b} = A^{-0.5}b$ , $\widehat{\mathbf{w}} = A^{0.5} \mathbf{w}$
Gradient descent rule

$\widehat{\mathbf{w}}^{(k+1)}=\widehat{\mathbf{w}}^{(k)}-\eta \nabla_{\widehat{\mathbf{w}}} E\left(\widehat{\mathbf{w}}^{(k)}\right)^{T}$

$\mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta \mathbf{A}^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T}$

So we just need to caculate $\mathbf{A}^{-1}$ , and the step size of each direction is all the same(1)
For generic differentiable multivariate convex functions
- Also use Taylor expansion
- $E(\mathbf{w}) \approx E\left(\mathbf{w}^{(k)}\right)+\nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)\left(\mathbf{w}-w^{(k)}\right)+\frac{1}{2}\left(\mathbf{w}-w^{(k)}\right)^{T} H_{E}\left(w^{(k)}\right)\left(\mathbf{w}-w^{(k)}\right)+\cdots$
- We get the normalized update rule
- $\mathbf{w}^{(k+1)}=\mathbf{w}^{(k)}-\eta H_{E}\left(\boldsymbol{w}^{(k)}\right)^{-1} \nabla_{\mathbf{w}} E\left(\mathbf{w}^{(k)}\right)^{T}$
- Use quadratic approximations to get the maximum

Issues

Hessian

For complex models such as neural networks, with a very large number of parameters, the Hessian is extremely difficult to compute
For non-convex functions, the Hessian may not be positive semi-definite, in which case the algorithm can diverge

Learning rate

For complex models such as neural networks the loss function is often not convex
- $\eta > 2\eta_{opt}$ can actually help escape local optima
However always having $\eta > 2\eta_{opt}$ will ensure that you never ever actually find a solution
Using Decaying learning rate

5 Convergence